512 research outputs found
Language Model Combination and Adaptation Using Weighted Finite State Transducers
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaption may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequence
Adapting an Unadaptable ASR System
As speech recognition model sizes and training data requirements grow, it is
increasingly common for systems to only be available via APIs from online
service providers rather than having direct access to models themselves. In
this scenario it is challenging to adapt systems to a specific target domain.
To address this problem we consider the recently released OpenAI Whisper ASR as
an example of a large-scale ASR system to assess adaptation methods. An error
correction based approach is adopted, as this does not require access to the
model, but can be trained from either 1-best or N-best outputs that are
normally available via the ASR API. LibriSpeech is used as the primary target
domain for adaptation. The generalization ability of the system in two distinct
dimensions are then evaluated. First, whether the form of correction model is
portable to other speech recognition domains, and secondly whether it can be
used for ASR models having a different architecture.Comment: submitted to INTERSPEEC
Adapting an ASR Foundation Model for Spoken Language Assessment
A crucial part of an accurate and reliable spoken language assessment system
is the underlying ASR model. Recently, large-scale pre-trained ASR foundation
models such as Whisper have been made available. As the output of these models
is designed to be human readable, punctuation is added, numbers are presented
in Arabic numeric form and abbreviations are included. Additionally, these
models have a tendency to skip disfluencies and hesitations in the output.
Though useful for readability, these attributes are not helpful for assessing
the ability of a candidate and providing feedback. Here a precise transcription
of what a candidate said is needed. In this paper, we give a detailed analysis
of Whisper outputs and propose two solutions: fine-tuning and soft prompt
tuning. Experiments are conducted on both public speech corpora and an English
learner dataset. Results show that we can effectively alter the decoding
behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT
N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space
Error correction models form an important part of Automatic Speech
Recognition (ASR) post-processing to improve the readability and quality of
transcriptions. Most prior works use the 1-best ASR hypothesis as input and
therefore can only perform correction by leveraging the context within one
sentence. In this work, we propose a novel N-best T5 model for this task, which
is fine-tuned from a T5 model and utilizes ASR N-best lists as model input. By
transferring knowledge from the pre-trained language model and obtaining richer
information from the ASR decoding space, the proposed approach outperforms a
strong Conformer-Transducer baseline. Another issue with standard error
correction is that the generation process is not well-guided. To address this a
constrained decoding process, either based on the N-best list or an ASR
lattice, is used which allows additional information to be propagated.Comment: submitted to INTERSPEEC
Successive loadings of reactant in the hydrogen generation by hydrolysis of sodium borohydride in batch reactors
In this paper, for the first time, an experimental investigation is presented of five successive loadings of reactant alkaline solution of sodium borohydride (NaBH4) for hydrogen generation, using an improved nickel-based powder catalyst, under uncontrolled ambient conditions. The experiments were performed in two batch reactors with internal volumes of 0.646 l and of 0.369 l. The compressed hydrogen generated, at pressures below hydrogen critical pressure, gives emphasis on the importance of considering solubility effects during reaction, leading to storage of hydrogen in the liquid phase inside the reactor. The present work suggests that the sodium metaborate by-product formed by the alkaline hydrolysis of NaBH4, in a closed pressure vessel without temperature control, is NaBO2.xH2O, with x ≥ 2. The data obtained in this work lends credit to x ≈ 2, which was discussed based on the XRD results, and this call for increased caution in the definition of the hydrolysis reaction of NaBH4 up to temperatures of 333 K and up to pressures of 0.13 MP
Blind Normalization of Speech From Different Channels
We show how to construct a channel-independent representation of speech that
has propagated through a noisy reverberant channel. This is done by blindly
rescaling the cepstral time series by a non-linear function, with the form of
this scale function being determined by previously encountered cepstra from
that channel. The rescaled form of the time series is an invariant property of
it in the following sense: it is unaffected if the time series is transformed
by any time-independent invertible distortion. Because a linear channel with
stationary noise and impulse response transforms cepstra in this way, the new
technique can be used to remove the channel dependence of a cepstral time
series. In experiments, the method achieved greater channel-independence than
cepstral mean normalization, and it was comparable to the combination of
cepstral mean normalization and spectral subtraction, despite the fact that no
measurements of channel noise or reverberations were required (unlike spectral
subtraction).Comment: 25 pages, 7 figure
Zero-shot Audio Topic Reranking using Large Language Models
The Multimodal Video Search by Examples (MVSE) project investigates using
video clips as the query term for information retrieval, rather than the more
traditional text query. This enables far richer search modalities such as
images, speaker, content, topic, and emotion. A key element for this process is
highly rapid, flexible, search to support large archives, which in MVSE is
facilitated by representing video attributes by embeddings. This work aims to
mitigate any performance loss from this rapid archive search by examining
reranking approaches. In particular, zero-shot reranking methods using large
language models are investigated as these are applicable to any video archive
audio content. Performance is evaluated for topic-based retrieval on a publicly
available video archive, the BBC Rewind corpus. Results demonstrate that
reranking can achieve improved retrieval ranking without the need for any
task-specific training data
IMPROVING MULTIPLE-CROWD-SOURCED TRANSCRIPTIONS USING A SPEECH RECOGNISER
ABSTRACT This paper introduces a method to produce high-quality transcriptions of speech data from only two crowd-sourced transcriptions. These transcriptions, produced cheaply by people on the Internet, for example through Amazon Mechanical Turk, are often of low quality. Often, multiple crowd-sourced transcriptions are combined to form one transcription of higher quality. However, the state of the art is to use essentially a form of majority voting, which requires at least three transcriptions for each utterance. This paper shows how to refine this approach to work with only two transcriptions. It then introduces a method that uses a speech recogniser (bootstrapped on a simple combination scheme) to combine transcriptions. When only two crowd-sourced transcriptions are available, on a noisy data set this improves the word error rate to gold-standard transcriptions by 21 % relative
Antimicrobial activity of a library of thioxanthones and their potential as efflux pump inhibitors
The overexpression of efflux pumps is one of the causes of multidrug resistance, which leads to the inefficacy of drugs. This plays a pivotal role in antimicrobial resistance, and the most notable pumps are the AcrAB-TolC system (AcrB belongs to the resistance-nodulation-division family) and the NorA, from the major facilitator superfamily. In bacteria, these structures can also favor virulence and adaptation mechanisms, such as quorum-sensing and the formation of biofilm. In this study, the design and synthesis of a library of thioxanthones as potential efflux pump inhib-itors are described. The thioxanthone derivatives were investigated for their antibacterial activity and inhibition of efflux pumps, biofilm formation, and quorum-sensing. The compounds were also studied for their potential to interact with P-glycoprotein (P-gp, ABCB1), an efflux pump present in mammalian cells, and for their cytotoxicity in both mouse fibroblasts and human Caco-2 cells. The results concerning the real-time ethidium bromide accumulation may suggest a potential bacterial efflux pump inhibition, which has not yet been reported for thioxanthones. Moreover, in vitro studies in human cells demonstrated a lack of cytotoxicity for concentrations up to 20 µM in Caco-2 cells, with some derivatives also showing potential for P-gp modulation.This research was supported by national funds through FCT (Foundation for Science and Technology) within the scope of UIDB/04423/2020, UIDP/04423/2020 (Group of Natural Products and Medicinal Chemistry-CIIMAR), and under the project PTDC/SAU-PUB/28736/2017 (reference POCI-01–0145-FEDER-028736), co-financed by COMPETE 2020, Portugal 2020 and the European Union through the ERDF and by FCT through national funds and structured program of R&D&I ATLANTIDA (NORTE-01-0145-FEDER-000040), supported by NORTE2020, through ERDF, and CHIRALBIO ACTIVE-PI-3RL-IINFACTS-2019
- …